AITopics | style prompt

Collaborating Authors

style prompt

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

Peng, Yizhou, Ma, Yukun, Zhang, Chong, Chao, Yi-Wen, Ni, Chongjia, Ma, Bin

arXiv.org Artificial IntelligenceOct-16-2025

While Text-to-Speech (TTS) systems can achieve fine-grained control over emotional expression via natural language prompts, a significant challenge emerges when the desired emotion (style prompt) conflicts with the semantic content of the text. This mismatch often results in unnatural-sounding speech, undermining the goal of achieving fine-grained emotional control. Classifier-Free Guidance (CFG) is a key technique for enhancing prompt alignment; however, its application to auto-regressive (AR) TTS models remains underexplored, which can lead to degraded audio quality. This paper directly addresses the challenge of style-content mismatch in AR TTS models by proposing an adaptive CFG scheme that adjusts to different levels of the detected mismatch, as measured using large language models or natural language inference models. This solution is based on a comprehensive analysis of CFG's impact on emotional expressiveness in state-of-the-art AR TTS models. Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.13293

Country: Asia (0.29)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)

Add feedback

P2VA: Converting Persona Descriptions into Voice Attributes for Fair and Controllable Text-to-Speech

Lee, Yejin, Kang, Jaehoon, Shim, Kyuhong

arXiv.org Artificial IntelligenceSep-22-2025

While persona-driven large language models (LLMs) and prompt-based text-to-speech (TTS) systems have advanced significantly, a usability gap arises when users attempt to generate voices matching their desired personas from implicit descriptions. Most users lack specialized knowledge to specify detailed voice attributes, which often leads TTS systems to misinterpret their expectations. To address these gaps, we introduce Persona-to-Voice-Attribute (P2VA), the first framework enabling voice generation automatically from persona descriptions. Our approach employs two strategies: P2VA-C for structured voice attributes, and P2VA-O for richer style descriptions. Evaluation shows our P2VA-C reduces WER by 5% and improves MOS by 0.33 points. To the best of our knowledge, P2VA is the first framework to establish a connection between persona and voice synthesis. In addition, we discover that current LLMs embed societal biases in voice attributes during the conversion process. Our experiments and findings further provide insights into the challenges of building persona-voice systems.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.17093

Country: Asia (0.29)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback

Yan: Foundational Interactive Video Generation

Ye, Deheng, Zhou, Fangyun, Lv, Jiacheng, Ma, Jianqi, Zhang, Jun, Lv, Junyan, Li, Junyou, Deng, Minwen, Yang, Mingyu, Fu, Qiang, Yang, Wei, Lv, Wenkai, Yu, Yangbin, Wang, Yewen, Guan, Yonghang, Hu, Zhihao, Fang, Zhongbin, Sun, Zhongqian

arXiv.org Artificial IntelligenceAug-15-2025

We present Yan, a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing. Specifically, Yan comprises three core modules. AAA-level Simulation: We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process, achieving real-time 1080P/60FPS interactive simulation. Multi-Modal Generation: We introduce a hierarchical autoregressive caption method that injects game-specific knowledge into open-domain multi-modal video diffusion models (VDMs), then transforming the VDM into a frame-wise, action-controllable, real-time infinite interactive video generator. Notably, when the textual and visual prompts are sourced from different domains, the model demonstrates strong generalization, allowing it to blend and compose the style and mechanics across domains flexibly according to user prompts. Multi-Granularity Editing: We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text. Collectively, Yan offers an integration of these modules, pushing interactive video generation beyond isolated capabilities toward a comprehensive AI-driven interactive creation paradigm, paving the way for the next generation of creative tools, media, and entertainment. The project page is: https://greatx3.github.io/Yan/.

arxiv preprint arxiv, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.08601

Genre:

Research Report (0.50)
Workflow (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis

Luo, Dan, Ma, Chengyuan, Li, Weiqin, Wang, Jun, Chen, Wei, Wu, Zhiyong

arXiv.org Artificial IntelligenceApr-15-2025

With the advancement of speech synthesis technology, users have higher expectations for the naturalness and expressiveness of synthesized speech. But previous research ignores the importance of prompt selection. This study proposes a text-to-speech (TTS) framework based on Retrieval-Augmented Generation (RAG) technology, which can dynamically adjust the speech style according to the text content to achieve more natural and vivid communication effects. We have constructed a speech style knowledge database containing high-quality speech samples in various contexts and developed a style matching scheme. This scheme uses embeddings, extracted by Llama, PER-LLM-Embedder,and Moka, to match with samples in the knowledge database, selecting the most appropriate speech style for synthesis. Furthermore, our empirical research validates the effectiveness of the proposed method. Our demo can be viewed at: https://thuhcsi.github.io/icme2025-AutoStyle-TTS

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2504.10309

Country: Asia > China (0.15)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Scaling Rich Style-Prompted Text-to-Speech Datasets

Diwan, Anuj, Zheng, Zhisheng, Harwath, David, Choi, Eunsol

arXiv.org Artificial IntelligenceMar-6-2025

We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .

dataset, preprint, situational tag, (16 more...)

arXiv.org Artificial Intelligence

2503.04713

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New York (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Gender Bias in Instruction-Guided Speech Synthesis Models

Kuan, Chun-Yi, Lee, Hung-yi

arXiv.org Artificial IntelligenceFeb-8-2025

Recent advancements in controllable expressive speech synthesis, especially in text-to-speech (TTS) models, have allowed for the generation of speech with specific styles guided by textual descriptions, known as style prompts. While this development enhances the flexibility and naturalness of synthesized speech, there remains a significant gap in understanding how these models handle vague or abstract style prompts. This study investigates the potential gender bias in how models interpret occupation-related prompts, specifically examining their responses to instructions like "Act like a nurse". We explore whether these models exhibit tendencies to amplify gender stereotypes when interpreting such prompts. Our experimental results reveal the model's tendency to exhibit gender bias for certain occupations. Moreover, models of different sizes show varying degrees of this bias across these occupations.

artificial intelligence, speech synthesis, style prompt, (13 more...)

arXiv.org Artificial Intelligence

2502.05649

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > Taiwan (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)

Add feedback

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

Kawamura, Masaya, Yamamoto, Ryuichi, Shirahata, Yuma, Hasumi, Takuya, Tachibana, Kentaro

arXiv.org Artificial IntelligenceJun-12-2024

We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing English prompt datasets, our corpus provides more diverse prompt annotations for all speakers of LibriTTS-R. Experimental results for prompt-based controllable TTS demonstrate that the TTS model trained with LibriTTS-P achieves higher naturalness than the model using the conventional dataset. Furthermore, the results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset. Our corpus, LibriTTS-P, is available at https://github.com/line/LibriTTS-P.

dataset, libritts-p, style prompt, (13 more...)

arXiv.org Artificial Intelligence

2406.07969

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Switzerland > Vaud (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.42)

Add feedback

PromptASR for contextualized ASR with controllable style

Yang, Xiaoyu, Kang, Wei, Yao, Zengwei, Yang, Yifan, Guo, Liyong, Kuang, Fangjun, Lin, Long, Povey, Daniel

arXiv.org Artificial IntelligenceJan-24-2024

Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. Specifically, a dedicated text encoder encodes the text prompts and the encodings are injected into the speech encoder by cross-attending the features from two modalities. When using the ground truth text from preceding utterances as content prompt, the proposed system achieves 21.9% and 6.8% relative word error rate reductions on a book reading dataset and an in-house dataset compared to a baseline ASR system. The system can also take word-level biasing lists as prompt to improve recognition accuracy on rare words. An additional style prompt can be given to the text encoder and guide the ASR system to output different styles of transcriptions. The code is available at icefall.

content prompt, promptasr, utterance, (15 more...)

arXiv.org Artificial Intelligence

2309.07414

Country:

Europe > Spain > Galicia > Madrid (0.05)
North America > United States > New York (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(7 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.90)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

Shimizu, Reo, Yamamoto, Ryuichi, Kawamura, Masaya, Shirahata, Yuma, Doi, Hironori, Komatsu, Tatsuya, Tachibana, Kentaro

arXiv.org Artificial IntelligenceDec-27-2023

We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.

encoder, speaker prompt, style prompt, (14 more...)

arXiv.org Artificial Intelligence

2309.0814

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > West Midlands > Birmingham (0.04)
Asia > Japan > Honshū > Tōhoku (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.74)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

Zhang, Hanglei, Guo, Yiwei, Liu, Sen, Chen, Xie, Yu, Kai

arXiv.org Artificial IntelligenceNov-2-2023

Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in expressive TTS empower users with the ability to directly control synthesis style through natural language prompts. However, these methods often require excessive training with a significant amount of style-annotated data, which can be challenging to acquire. Moreover, they may have limited adaptability due to fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations. Our approach utilizes a large language model (LLM) to transform expressive TTS into a style retrieval task. The LLM selects the best-matching style references from annotated utterances based on external style prompts, which can be raw input text or natural language style descriptions. The selected reference guides the TTS pipeline to synthesize speeches with the intended style. This innovative approach provides flexible, versatile, and precise style control with minimal human workload. Experiments on a Mandarin storytelling corpus demonstrate FS-TTS's proficiency in leveraging LLM's semantic inference ability to retrieve desired styles from either input text or user-defined descriptions. This results in synthetic speeches that are closely aligned with the specified styles.

annotation, speech, style description, (11 more...)

arXiv.org Artificial Intelligence

2311.0126

Country:

Asia > China > Shanghai > Shanghai (0.04)
Africa > South Africa > Western Cape > Cape Town (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback